Bilateral Denoising Diffusion Models (BDDMs)

Abstract: Denoising diffusion models were designed with a simple forward process yet brought challenges for efficient sampling. Instead of striving for an accelerated sampler, we propose new bilateral denoising diffusion models (BDDMs) that parameterize the forward and reverse processes, with a score network and a scheduling network, respectively. From a bilateral modeling objective, we derive a tighter lower bound as a surrogate objective for the likelihood to achieve exceedingly high-quality and fast generation compared to other cutting-edge samplers. In particular, with a negligible training overhead, the proposed BDDMs generated significantly higher-quality samples with a 62x inference speed up relative to the denoising diffusion probabilistic models.

Fast and high-fidelity speech generation using a 7-step noise schedule estimated by BDDMs:

Note: Consider reducing the volume for the first few iterations below as they are mostly white noise.

Text: Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition.

Reverse Step 1: LS-MSE=3393, MCD=6.55, STOI=0.405, PESQ=0.641
Reverse Step 2: LS-MSE=2706, MCD=6.19, STOI=0.493, PESQ=0.681
Reverse Step 3: LS-MSE=1805, MCD=5.58, STOI=0.646, PESQ=1.06
Reverse Step 4: LS-MSE=1080, MCD=4.86, STOI=0.813, PESQ=1.58
Reverse Step 5: LS-MSE=584, MCD=4.01, STOI=0.928, PESQ=2.23
Reverse Step 6: LS-MSE=284, MCD=3.10, STOI=0.973, PESQ=2.87
Reverse Step 7: LS-MSE=77.2, MCD=1.94, STOI=0.984, PESQ=4.02

Comparing Convergence of BDDM's Predicted Schedule Against Linear Schedule:

Remark: Only the last few steps in the linear schedule appear to be significant, while many moves are actually trivial in the cepstral domain. This asserts the assumption in BDDMs that it is possible to significantly reduce the sampling steps by cleverly selecting a proper noise scale (step size) for each score (gradient); meanwhile maintaining a comparable or even higher quality of generation.

LJ speech samples from different generative diffusion models:

Note: Different rows correspond to different noise schedules or sampling methods for inference.

Text and having, quote, somewhat bushy, end quote, hair. since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President. since a disclosure of such detailed information relating to protective measures might undermine present methods of protecting the President.
Ground Truth
DDPM - 8 steps (Grid Search)
DDPM - 1000 steps (Linear)
DDIM - 8 steps (Linear)
DDIM - 100 steps (Linear)
NE - 8 steps (Linear)
Ours BDDM - 8 steps

VCTK samples from different generative diffusion models:

Note: Different rows correspond to different noise schedules or sampling methods for inference.

Text Frankly, we should all have such problems. I felt he was excellent. Frankly, we should all have such problems.
Ground Truth
DDPM - 8 steps (Grid Search)
DDPM - 1000 steps (Linear)
DDIM - 8 steps (Linear)
DDIM - 100 steps (Linear)
NE - 8 steps (Linear)
Ours BDDM - 8 steps

CIFAR-10 samples generated from BDDM:

BDDM - 10 steps BDDM - 20 steps BDDM - 100 steps